CUDA Programming Guide: Foundations of CUDA Kernel Development

CUDA Kernel development begins with the definition of a Kernel, which is a specialized C++ function designed to execute in parallel across the massive core count of an NVIDIA GPU. These functions represent the fundamental unit of work in the CUDA programming model, acting as the bridge where serial host logic transitions into massively parallel device execution.

1. The global Specifier

The __global__ declaration specifier is a required API qualifier that instructs the compiler to generate code for the GPU while keeping the function entry point visible to the CPU. Functions which execute on the GPU which can be invoked from the host are called kernels.

2. Execution Environment

Kernels are dispatched to and executed upon Streaming Multiprocessors (SMs). The SM is the primary computational engine within an NVIDIA GPU responsible for managing hundreds of concurrent threads. Each SM handles blocks of threads and schedules them onto processing cores.

Syntax Rule: Kernels must strictly return void. Because they operate asynchronously from the host, they cannot return a value directly to the CPU; they must write results back to allocated device memory.

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary function of the __global__ specifier?

It defines a function that runs on the CPU but is callable from the GPU.

It defines a kernel that runs on the GPU and is callable from the CPU.

It allocates memory on the GPU's SM cache.

It synchronizes all threads in a block.

✅ Correct!

Correct! __global__ is the bridge used to launch kernels from Host code.

❌ Incorrect

Incorrect. __global__ specifically identifies entry-point kernels for GPU execution called by the Host.

QUESTION 2

Why must CUDA kernels return void?

Because they execute asynchronously and have no direct path to return values to the Host thread.

To save registers on the SM.

Because GPU memory is read-only.

The NVCC compiler does not support float returns.

QUESTION 3

Which hardware component is responsible for managing and executing threads in a CUDA kernel?

The PCIe Controller.

The Streaming Multiprocessor (SM).

The Host RAM controller.

The BIOS.

QUESTION 4

What happens when a Host calls a kernel function?

The CPU halts until the GPU finish processing.

The GPU creates a clone of the function for every available SM.

The kernel is enqueued for execution on the GPU, and the CPU continues to the next instruction.

The CPU performs a context switch to the GPU.

QUESTION 5

Which of the following is the correct definition of a CUDA kernel?

A function that executes on the GPU and is invoked from the Host.

A C++ library for file I/O.

A hardware driver for NVIDIA GPUs.

A standard CPU function with the __gpu__ prefix.

Module Challenge: Designing a Vector Subtraction Kernel

Applying kernel fundamentals to data transformation.

You are tasked with porting a signal processing routine to the GPU. The core operation subtracts background noise (Vector B) from a signal (Vector A) into a result (Vector C).

1. Write the function signature for a kernel named 'vecSub' that takes three float pointers.

Solution:
__global__ void vecSub(float* A, float* B, float* C)

2. In which hardware unit will the logic inside your 'vecSub' kernel physically reside during execution?

Solution:
The logic will be executed on the Streaming Multiprocessors (SMs) of the NVIDIA GPU.

3. If you attempt to return a float status code from this kernel, why will the compiler throw an error?

Solution:
CUDA Kernels must have a void return type because they are executed asynchronously. Any status reporting or data return must be performed through device memory pointers.

1. The __global__ Specifier

2. Execution Environment

1. The global Specifier